Can Star Wars-related survey responses be used to predict whether a person earns more than $50,000 annually??
Data Loading & Missing Value Analysis
Show the code
# First 2 rows are column headers, created a map to make it easy to readnew_col_names = ["id", "seen_any", "fan_starwars", "seen_Ep_I", "seen_Ep_II", "seen_Ep_III", "seen_Ep_IV", "seen_Ep_V", "seen_Ep_VI", "rank_Ep_I", "rank_Ep_II", "rank_Ep_III", "rank_Ep_IV", "rank_Ep_V", "rank_Ep_VI", "fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda", "who_shot_first", "familiar_expanded_universe", "fan_expanded_universe", "fan_startrek", "gender", "age", "income", "educ", "region"] df = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors="ignore", names = new_col_names, skiprows =2) display(df.head())
id
seen_any
fan_starwars
seen_Ep_I
seen_Ep_II
seen_Ep_III
seen_Ep_IV
seen_Ep_V
seen_Ep_VI
rank_Ep_I
...
fav_yoda
who_shot_first
familiar_expanded_universe
fan_expanded_universe
fan_startrek
gender
age
income
educ
region
0
3292879998
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
3.0
...
Very favorably
I don't understand this question
Yes
No
No
Male
18-29
NaN
High school degree
South Atlantic
1
3292879538
No
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
Yes
Male
18-29
$0 - $24,999
Bachelor degree
West South Central
2
3292765271
Yes
No
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
NaN
NaN
NaN
1.0
...
Unfamiliar (N/A)
I don't understand this question
No
NaN
No
Male
18-29
$0 - $24,999
High school degree
West North Central
3
3292763116
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Very favorably
I don't understand this question
No
NaN
Yes
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
4
3292731220
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Somewhat favorably
Greedo
Yes
No
No
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
5 rows × 38 columns
The original Star Wars survey dataset from FiveThirtyEight had two header rows and inconsistent column names. To make the data more usable, I recreated the column names to be more user-friendly.
Example: instead of long survey text like “Which of the following characters is your favorite? - Han Solo”, I renamed it to simply fav_han.
I also skipped the first two rows to avoid duplicated headers and formatting artifacts.
Now the dataset loads cleanly into a DataFrame with 38 columns covering Star Wars movie viewership, rankings, favorite characters, and demographic information (gender, age, income, education, region).
From here, I’ll inspect the structure, data types, and missing values to prompt the direction of this project.
Show the code
datatypes = df.dtypes# Percent missing valuesmissing = (df.isna().sum() /len(df) *100).round(1)display(f"Top 10 Missing Values Percentages of Columns")display(missing.sort_values(ascending=False).head(10)) # Show top 10
The columns with the highest percentage of missing values reveal clear patterns in survey behavior:
Expanded Universe fandom (fan_expanded_universe) is missing in 82% of responses. This suggests most participants were either unfamiliar with or chose not to answer questions about the broader Star Wars canon.
Movie viewing history (seen_Ep_I through seen_Ep_VI) shows between 36–54% missingness. This likely reflects that many respondents did not watch certain films and skipped those questions.
Favorite character fields such as fav_boba, fav_palpatine, and fav_padme each have around 31% missingness, consistent with respondents who had not seen the films and therefore did not select a favorite.
These missing values are not random (NMAR) but tied to the survey’s logic. Respondents who had not seen the movies were more likely to leave rankings and character preferences blank.
Because the missing values reflect survey behavior rather than data entry errors, they will need to be handled thoughtfully (e.g., recoding to “Not seen” or “No favorite”). However, before diving into a full cleaning process, it is important to first explore the structure of the variables themselves.
The next step will be to examine the nature of the target variable and key predictors in order to determine which model(s) will be best suited for this project.
Exploratory Data Analysis
Show the code
# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count
Proportion
income
NaN
328
0.28
$50,000 - $99,999
298
0.25
$25,000 - $49,999
186
0.16
$100,000 - $149,999
141
0.12
$0 - $24,999
138
0.12
$150,000+
95
0.08
The raw income variable contains six categories plus a substantial amount of missing data. This distribution shows two important points:
The variable, income, is categorical, not continuous — respondents selected income brackets, not exact dollar values. The distribution is fairly even, with the 50k-100k group the most common and 0-25k least common.
Therefore, I will encode the variables as 0, for <=50k, and 1, for >50k
This binning aligns with the project goal: predicting whether someone earns more than $50k annually. That decision makes logistic regression the appropriate baseline model, rather than linear regression.
Show the code
# Convert Income ranges to values 0 for less than 50kincome_map = {'$0 - $24,999': 0,'$25,000 - $49,999': 0,'$50,000 - $99,999': 1,'$100,000 - $149,999': 1,'$150,000+': 1}# Replacedf.loc[:,'income'] = df.loc[:,'income'].replace(income_map)# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count
Proportion
income
1.0
534
0.45
NaN
328
0.28
0.0
324
0.27
The new binning, makes the income variable fairly balanced, weighting towards over 50k having more representation as to <=50k.
I will now move on converting string-ordinal columns, to numeric-ordinal column for the purpose of visualizing, finding correlations, and preparing for modeling.
Show the code
# Convert age to numeric, picking the median of each, 66 for >60 because 72 is average life expectancyage_map = {'18-29': 23, '30-44': 37, '45-60': 52.5, '> 60': 66}# Replacedf['age'] = df['age'].map(age_map)# Convert fav_ columns to number rankfav_map = {'Very favorably': 2,'Somewhat favorably': 1,'Neither favorably nor unfavorably (neutral)': 0,'Unfamiliar (N/A)': 0,'Somewhat unfavorably': -1,'Very unfavorably': -2}# Replacedf.iloc[:, list(range(15,29))] = df.iloc[:, list(range(15,29))].replace(fav_map)
Show the code
# Select favorability columns by name pattern and change dtype for correlation laterfav_cols = [col for col in df.columns if'fav_'in col]df[fav_cols] = df[fav_cols].astype('float64')rank_cols = [col for col in df.columns if'rank_'in col]df[rank_cols] = df[rank_cols].astype('Int64')
Show the code
# Select all numeric columnsnumeric_cols = [col for col in df.columns if df[col].dtype in ['int64', 'int8', 'Int64', 'Int8', 'float64']]# Calculate correlations with incomecorrelations = df[numeric_cols + ['income']].corrwith(df['income'])correlations = correlations.drop('income').sort_values(key=abs, ascending=False)print("Top 5 correlations with income (>$50k):")display((correlations.head(5)))print("\nBottom 5 correlations:")display(correlations.tail(5))df['income'] = df['income'].astype('bool')
From the aforementioned correlations, we can see certain characters, and movies that have a small correlation around 10% with income.
Age, had somewhat to do as most would have guessed, with a 12% correlation with income.
The main standout from this, is no one variable is worth exploring as being a predictor of income by itself, but rather an interaction of multiple variables.
In the coming graphs, I am looking for different distributions, which will tell me there is a meaningful relationship between 2 variables on income.
Show the code
( ggplot(df, aes(x="fan_startrek", fill="fan_starwars")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Star Trek Fan", y="Count", fill="Star Wars Fan" ))
Show the code
( ggplot(df, aes('rank_Ep_II')) + geom_bar() + facet_wrap(facets='income') + labs( title="Proportion of Star Trek Fans Across Income Levels", x="Age", y="Count" ))
Show the code
top_regions = df['region'].value_counts().nlargest(6).indexdf2 = dfdf2['region_plot'] = np.where(df2['region'].isin(top_regions), df2['region'], 'Other')( ggplot(df2, aes(x="region_plot")) + geom_bar() + coord_flip() + facet_grid(y='income', x='gender') + labs(title="Region Distribution by Income and Gender", x="Region", y="Income"))
( ggplot(df.dropna(subset=['age']), aes(x="income", y="age")) + geom_boxplot() + coord_cartesian(ylim=(15, 100)) + labs(title="Age by Income Bracket", x="Income", y="Age"))
Show the code
( ggplot(df, aes(x="educ", fill="income")) + geom_bar(position="fill") + coord_flip() + labs( title="Education and Income (Proportions)", x="Education Level", y="Proportion", fill="Income" ))
As I explored multi layered relationships, a few interesting things occured:
The interaction between gender, and education show differing distributions, perhaps uncovering a good predictor of income.
The interactions between age and education also added to a differing distribution.
All in all, the distributions are more similar than different across almost all interactions explored.
Therefore, I will entertain multiple tree models, including Random Forest, K-Nearest Neighbors, Gradient Boosting, Decision Tree, and NB booster. This project is limited by my knowledge, which only includes these ML algorithms as of today.
Model Prep
Show the code
# factorize into 1s 0s education columndf['educ'] = df['educ'].astype('category')df['educ'] = pd.factorize(df['educ'])[0]df['educ'] = df['educ'].astype('category')# Drop ID columndf1 = df.drop(df.columns[0], axis=1)# Separate columns to exclude from one-hot encodingencode_mask = df1.columns.str.contains(r'fav_|income|educ|age', case=False)new_encode = df1.loc[:, ~encode_mask].astype("category")old_encode = df1.loc[:, encode_mask]# One-hot encode the remaining categorical/object columnssw_encoded = pd.get_dummies(new_encode, dummy_na=True, prefix_sep='_', drop_first=False, dtype=int)sw_encoded.columns = sw_encoded.columns.str.replace(' ', '_')# Concatenate back the excluded columnssw = pd.concat([df.iloc[:,0], sw_encoded, old_encode], axis=1)# Drop rows with 7 or more nansw = sw[sw.isna().sum(axis=1) <=7]# Drop rows with nan in outcome variablesw = sw.dropna(subset=['income'])# drop ID columnsw = sw.drop('id', axis=1)
To begin, I transformed age into numeric midpoints and randomly assigned values between 60 and 80 for respondents who reported > 60. The income variable was recoded into a binary classification target, where:
\texttt{income} =
\begin{cases}
0 & \text{if income } < \$50{,}000 \\\\
1 & \text{if income } \geq \$50{,}000
\end{cases}
Favorability ratings for characters (columns prefixed with fav_) were mapped to a numerical scale from -2 (very unfavorable) to +2 (very favorable). I excluded high-cardinality columns from one-hot encoding (fav_, income, educ, and age), then encoded the remaining categorical variables using pd.get_dummies() with dummy_na=True. I then dropped rows with 7 or more missing values or any missing values in the target variable.
Show the code
display(f"{len(df1)} Observations")display(f"Percent of Observations Missing")display((df1.isna().sum() /len(df1) *100).sort_values(ascending=False))
After preprocessing, I assigned the predictors to x and the binary target variable to y = income. The dataset was split into training and test sets using a 75/25 ratio. Missing values in x were either filled with 0 (for models that require complete input like GradientBoostingClassifier and GaussianNB) or left as-is if the model could tolerate it.
I trained and evaluated the following classifiers:
RandomForestClassifier
GradientBoostingClassifier
DecisionTreeClassifier
GaussianNB (Naive Bayes)
Each model’s performance was evaluated using accuracy_score and classification_report from sklearn.metrics.
Show the code
# Assign Predictors and outcome variablesx = sw.drop(columns='income')y = sw['income'].astype(int)
Show the code
# Split dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=50)# Split and fill na's for gradient/Gassianx_train_na = x_train.fillna(0)x_test_na = x_test.fillna(0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Random Forest achieved the highest overall accuracy at 69.0%, with strong recall (91%) and F1-score (0.81) for predicting individuals earning at least $50k. However, it performed very poorly on the lower-income class with just 22% f1 and 15% recall.
Gradient Boosting followed with an accuracy of 67.8%. It showed identical results to random forest as far as recall, precision, and f1 go. In essence, great at predicting the true cases, and terrible at the untrue cases.
** Decision Tree classifier follows the same pattern overall as Random Forest and Gradient Boosting Classifiers.
Gaussian Naive Bayes reached 28..8% accuracy but completely failed to classify high-income respondents, with 1% precision. It was heavily biased toward predicting class 2 only.
**K-Neighbors, follows a relatively similar pattern, with 67% accuracy and a little more balanced recall and f1 scores.
Conclusion and Recommendation
While Gradient Boosting offered a more balanced performance across both classes, I believe the Random Forest Classifier is the better choice in this context. Since the primary goal is to accurately predict whether someone earns more than $50,000, overall accuracy and precision for the high-income class matter more than recall for the low-income group. Random Forest achieved the highest accuracy and the strongest precision for class 1, making it more effective at identifying higher earners — which aligns directly with the objective of this analysis.
Feature Importance in Gradient Boosting Results
After training the RandomForestClassifier, I examined which features contributed most to the model’s predictions.
Using the feature_importances_ attribute of the trained model, I identified the top predictors for determining whether a respondent earns over $50,000. These features reflect a combination of demographic factors and Star Wars-related preferences that the model found most useful for predicting income level.
Here are the top 10 most important features:
Show the code
# Get top 10 features by importanceimportances = classifier_DT.feature_importances_features = x_train.columnsimportance_df = pd.DataFrame({'Feature': features, 'Importance': importances})top_features = importance_df.sort_values(by='Importance', ascending=False).head(10)ggplot(top_features, aes(x='Feature', y='Importance', fill='Feature')) +\ geom_bar(stat='identity', color='black', size=0.4) +\ ggtitle('Feature Importance - Random Forest') +\ xlab('Feature') +\ ylab('Importance') +\ theme_minimal() +\ theme( axis_text_x=element_text(angle=45, hjust=1, size=12), axis_text_y=element_text(size=12), plot_title=element_text(size=16, face='bold', hjust=0.5), legend_position='none' )
In the Random Forest model, feature importance was more evenly distributed across a mix of predictors. Among the features, age and educ were most important, which aligns with expected links between demographics and income. Several Star Wars favorability ratings — including fav_jar, fav_darth, fav_anakin, and fav_padme — also contributed meaningfully. This broader distribution of importance suggests the model’s draws on a variety of demographic and preference-based factors.
In conclusion, it seems as if star wars related preferences and education and age, are not very powerful in predicting income being greater than 50k or not.